{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Efficiently serving Large Language Models in 2bit with `aqlm` and `vLLM`\n",
    "\n",
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_vllm.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>\n",
    "\n",
    "Welcome to this notebook that goes through the recent `aqlm` integration with the [`vLLM`](https://github.com/vllm-project/vllm) serving framework.\n",
    "\n",
    "To the best of our knowlendge, this is the most efficient way to run AQLM in high-performance production setting.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install vllm>=0.4.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading the model\n",
    "\n",
    "The workflow doesn't have any major differences form the usual [quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html) workflow taught by `vLLM`.\n",
    "\n",
    "The only extra thing we should mention is that we recommend setting `enforce_eager=True` to not compile the CUDA graph because it introduces huge memory overheads undermining the quantization memory saving benefits."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from vllm import LLM, SamplingParams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "llm = LLM(\n",
    "    model=\"ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16\", # An AQLM model checkpoint\n",
    "    enforce_eager=True,  # Don't compile the graph\n",
    "    gpu_memory_utilization=0.99,\n",
    "    max_model_len=1024,\n",
    ")\n",
    "tokenizer = llm.get_tokenizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations = tokenizer.apply_chat_template(\n",
    "    [{'role': 'user', 'content': 'Generate a poem about the sun in Spanish'}],\n",
    "    tokenize=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "outputs = llm.generate(\n",
    "    [conversations],\n",
    "    SamplingParams(\n",
    "        temperature=0.8,\n",
    "        top_p=0.9,\n",
    "        max_tokens=1024,\n",
    "        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids(\"<|eot_id|>\")],\n",
    "    ),\n",
    "    use_tqdm=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<|start_header_id|>assistant<|end_header_id|>\n",
      "\n",
      "¡Con gusto! Here's a poem about the sun in Spanish:\n",
      "\n",
      "Sol de oro, de luz me envuelve,\n",
      "En cada momento, me ilumina con calidez,\n",
      "Con tus rayos, me acaliento, me cálculo,\n",
      "Y me guía en el camino, en cada momento.\n",
      "\n",
      "Tu faz es el sol, que me hace sentir,\n",
      "Que todo es posible, todo es real,\n",
      "Que no hay nada que no pueda hacer,\n",
      "Y que mi vida es una eternidad, en tu luz.\n",
      "\n",
      "Sol de mi vida, sin ti no soy,\n",
      "Un ser sin luz, sin ti no soy,\n",
      "Y sin ti, el mundo es una oscuridad,\n",
      "Y sin ti, no hay vida, no hay vida.\n",
      "\n",
      "¡Sol de mi vida, te amo tanto!\n",
      "¡Sol de mi vida, que no te deje ir!\n",
      "¡Sol de mi vida, que siempre estés conmigo,\n",
      "Y que ilumines mi camino, y me guíes!\n",
      "\n",
      "Translation:\n",
      "\n",
      "Golden sun, you envelop me with light,\n",
      "In every moment, you illuminate me with warmth,\n",
      "With your rays, you calm me, you measure,\n",
      "And guide me on the path, in every moment.\n",
      "\n",
      "Your face is the sun, that makes me feel,\n",
      "That everything is possible, everything is real,\n",
      "That there's nothing that I cannot do,\n",
      "And that my life is an eternity, in your light.\n",
      "\n",
      "Sun of my life, without you, I am not,\n",
      "A being without light, without you, I am not,\n",
      "And without you, the world is darkness,\n",
      "And without you, there is no life, no life.\n",
      "\n",
      "Oh, sun of my life, I love you so much!\n",
      "Oh, sun of my life, don't leave me!\n",
      "Oh, sun of my life, always be with me,\n",
      "And guide me on my path, and lead me!\n",
      "\n",
      "Note: This is a personal poem, so it's not a traditional or classical poem, but it's a creative expression of the poet's feelings and thoughts about the sun.\n"
     ]
    }
   ],
   "source": [
    "print(outputs[0].outputs[0].text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Benchmarking\n",
    "\n",
    "Let us measure the generation speed.\n",
    "\n",
    "_Spoiler_: <details>\n",
    "  It's fast!\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tok/s: 54.03s\n"
     ]
    }
   ],
   "source": [
    "from time import perf_counter\n",
    "\n",
    "start = perf_counter()\n",
    "llm.generate(\n",
    "    [conversations],\n",
    "    SamplingParams(\n",
    "        min_tokens=128,\n",
    "        max_tokens=128,\n",
    "    ),\n",
    "    use_tqdm=False,\n",
    ")\n",
    "end = perf_counter()\n",
    "\n",
    "print(f\"Tok/s: {128/(end - start):.2f}\")"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "gpuClass": "standard",
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
